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Abstract 

Through the Non-EquiUbrium Green's Function (NEGF) formaUsm, quantum- 
scale device simulation can be performed with the inclusion of electron-phonon 
scattering. However, the simulation of realistically sized devices under the NEGF 
formalism typically requires prohibitive amounts of memory and computation time. 
Two of the most demanding computational problems for NEGF simulation involve 
mathematical operations with structured matrices called semiseparable matrices. 
In this work, we present parallel approaches for these computational problems 
which allow for efficient distribution of both memory and computation based upon 
the underlying device structure. This is critical when simulating realistically sized 
devices due to the aforementioned computational burdens. First, we consider de- 
termining a distributed compact representation for the retarded Green's function 
matrix G*. This compact representation is exact and allows for any entry in the 
matrix to be generated through the inherent semiseparable structure. The second 
parallel operation allows for the computation of electron density and current char- 
acteristics for the device. Specifically, matrix products between the distributed rep- 
resentation for the semiseparable matrix and the self-energy scattering terms 
in produce the less-than Green's function G^. As an illustration of the compu- 
tational efficiency of our approach, we stably generate the mobility for nanowires 
with cross-sectional sizes of up to 4.5nm, assuming an atomistic model with scat- 
tering. 

1 Introduction 

In the absence of electron-phonon scattering, the problem of computing density of 
states and transmission through the NEGF formalism reduces to a mathematical prob- 
lem of finding select entiies from the inverse of a typically large and sparse block 
tridiagonal matrix. Although there has been research into numerically stable and com- 
putationally efficient serial computing algorithms [l], when analyzing certain device 
geometries this type of method will result in prohibitive amounts of computation and 
memory consumption. In Q a parallel divide-and-conquer algorithm (PDIV) was 
shown to be effective for NEGF based simulations. Two applications were presented, 
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the atomistic level simulation of silicon nanowires and the two-dimensional simula- 
tion of nanotransistors. Alternative, serial computing, NEGF based approaches such 
as U I rely on specific problem structure (currently not capable of addressing atomistic 
models), with computation limitations that again restrict the size of simulation. In ad- 
dition, the ability to compute information needed to determine current characteristics 
for devices has not been demonstrated with methods such as [3 1. The prohibitive com- 
putational properties associated with NEGF based simulation has prompted a transition 
to wave function based methods, such as those presented in |4|. Given an assumed ba- 
sis structure the problem translates from calculating select entries from the inverse of a 
matrix to solving large sparse systems of linear equations. This is an attractive alterna- 
tive because solving large sparse systems of equations is one of the most well studied 
problems in applied mathematics and physics. In addition, popular algorithms such as 
UMFPACK |5| and SuperLU 16] have been constructed to algebraically (based solely 
on the matrix) exploit problem specific structure in an attempt to minimize the amount 
of computation. The performance of these algorithms for the wave function based anal- 
ysis of several silicon nanowires has been examined in [7 |. However, for many devices 
of interest wave function methods are currently unable to address more sophisticated 
analyses that involve electron-phonon scattering. Thus, there remains a strong need to 
further develop NEGF based algorithms for the simulation of realistically sized devices 
considering these general modeling techniques. 

When incorporating the effects of scattering into NEGF based simulation, it be- 
comes necessary to determine the entire inverse of the coefficient matrix associated 
with the device. This substantially increases both the computational and memory re- 
quirements. There are a number of theoretical results describing the structure of the 
inverses of block tridiagonal and block-banded matrices. Representations for the in- 
verses of tridiagonal, banded, and block tridiagonal matrices can be found in |f8UT3|. It 
has been shown that the inverse of a tridiagonal matrix can be compactly represented 
by two sequences {m,} and {v,} lfT4ljT7l . This result was extended to the cases of block 
tridiagonal and banded matrices in ifTsHSOl . where the {m,} and {v,} sequences gener- 
alized to matrices {f/,} and {V,}. Matrices that can be represented in this fashion are 
more generally known as semiseparable matrices 11201 12 11 . Typically, the computation 
of parameters {m,} and {v,} suffers from numerical instability even for modest-sized 
problems |22|. It is well understood that for matrices arising in many physical appli- 
cations the {m,} and {v,} sequences grow exponentially II17II23 1 with the index /. One 
approach that has been successful in ameliorating these problems, for the tridiagonal 
case, is the generator approach shown in ll24l . Here, ratios for sequential elements of 
the {m,} and {v,} sequences are used as the generators for the inverse of a tridiagonal 
matrix. Such an approach is numerically stable for matrices of very large sizes. The 
extension of this generator approach to the general block-tridiagonal matrices was dis- 
cussed by the same authors in |13|. The authors used the block factorization of the 
original block-tridiagonal matrix to construct a block Cholesky decomposition of its 
inverse. 

A generator based approach for inversion, typically referred to as the Recursive 
Green's Function (RGF) algorithm, was introduced in [IJ for NEGF based simulation. 
It is important to note that the method of |[T] requires 3 x less memory to compute only 
the density of states and transmission (in the absence of scattering), when compared 
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to the complete generator representation. In this work, we extend the approach of ||2] 
to consider the computation of a distributed generator representation for the inverse 
of the block tridiagonal coefficient matrix. We then demonstrate how the distributed 
generator representation allows for the efficient computation of the electron density and 
current characteristics of the device. Our parallel algorithms facilitate the simulation 
of realistically sized devices by utilizing additional computing resources to efficiently 
divide both the computation time and memory requirements. As an illustration, we 
stably generate the mobility for4.5nm cross-section nanowires assuming an atomistic 
model with scattering. 



2 Inverses of Block Tridiagonal Matrices 

A block-symmetric matrix K is block tridiagonal if it has the form 
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where each A,-,Bi e Thus K e M^- A'vxA^jy,^ ^jjjj diagonal blocks of size Nx 

each. We will use the notation K = tn{Ai-N^,,Bi-N^,^i) to represent such a block tridi- 
agonal matrix. The NEGF based simulation of nanowires using the sp3d5s* atomistic 
tight-binding model with electron-phonon scattering has been demonstrated in ||251 . 
The block tridiagonal coefficient matrix for simulation is constructed in the following 
way: 

^:=(£/-//-E^-Ef-Ef). 

Here, E is the energy of interest, H is the Hamiltonian containing atomistic interactions, 
and , E^, and E| are the left and right boundary conditions and self energy scattering 
terms respectively. In order to calculate the current characteristics for the device we 
must first form the retarded Greens Function using the fact that KG^ = I. A standard 
numerically stable mathematical representation for the inverse of this block tridiago- 
nal matrix is dependent on two sequences of generator matrices {g^},{gf }■ Here, 
the terms ^ and ^ correspond to the forward and backward propagation through the 
device. Specifically, we can use the diagonal blocks of the inverse {£), } and the gener- 
ators to describe the inverse a block tridiagonal matrix K in the following manner: 
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Where the diagonal blocks of the inverse, D,, and the generator sequences satisfy the 
following relationships: 
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i^l,...,Ny-2, 



The time complexity associated with determining the parametrization of by the 
above approach is 0{NyNy), with a memory requirement of 0{NyNy). 



2.1 Alternative Approach for Determining the Compact Represen- 
tation 

It is important to note that if the block tridiagonal portion of is known, the generator 

^ R 

sequences g and can be extracted directly, i.e. without the use of entries from K 
through the generator expressions (|3). Examining closely the block tridiagonal portion 
of we find the following relations: 

(4) 

gf = Q,D;\ i^i,...,Ny~i, 

where f , denotes the (/, / + 1 ) block entry of G^ and g, denotes the (/ + 1 , /) block entry 
of G'^. Therefore, by being able to produce the block tridiagonal portion of G^ we have 
all the information that is necessary to compute the compact representation. 
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As was alluded to in Section[Tl direct techniques for simulation of realistic devices 
often require prohibitive memory and computational requirements. To address these 
issues we offer a parallel divide-and-conquer approach in order to construct the com- 
pact representation for G*, i.e. the framework allows for the parallel inversion of the 
coefficient matrix. Specifically, we introduce an efficient method for computing the 
block tridiagonal portion of in order to exploit the process demonstrated in (3). 

3 Parallel Inversion of Block Tridiagonal Matrices 

The compact representation of G^ can be computed in a distributed fashion by first 
creating several smaller sub-matrices (|),-. That is, the total number of blocks for the 
matrix K are divided as evenly as possible amongst the sub-matrices. After each in- 
dividual sub-matrix inverse has been computed they can be combined in a Radix-2 
fashion using the matrix inversion lemma from linear algebra. Figure [T] shows both 
the decomposition and the two combining levels needed to form the block tridiagonal 
portion of G^, assuming K has been divided into four sub-matrices. In general, if K is 
separated into p sub-matrices there will be log p combining levels with a total of p—\ 
combining operations or "steps". The notation (^J^j is introduced to represent the result 
of any combining step, through the use of the matrix inversion lemma. For example, 
(|)j^2 is inverse of a matrix comprised of the blocks assigned to both (|)i and (|)2. It is 
important to note that using the matrix inversion lemma repeatedly to join sub-matrix 
inverses will result in a prohibitive amount of memory and computation for large simu- 
lation problems. This is due to the fact that at each combining step all entries would be 
computed and stored. Thus, the question remains on the most efficient way to produce 
the block tridiagonal portion of G^, given this general decomposition scheme for the 
matrix K. 

In this work, we introduce a mapping scheme to transform compact representations 
of smaller matrix inverses into the compact representation of G^. The algorithm is 
organized as follows: 

• Decompose the block tridiagonal matrix K into p smaller block tridiagonal ma- 
trices. 

• Assign each sub-matrix to an individual CPU. 

• Independently determine the compact representations associated with each sub- 
matrix. 

• Gather all information that is needed to map the sub-matrix compact representa- 
tions into the compact representation for G^. 

• Independently apply the mappings to produce a portion of the compact represen- 
tation for G^ on each CPU. 

The procedure described above results in a "distributed compact representation" allow- 
ing for reduced memory and computational requirements. Specifically, each CPU will 
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Decomposition Sub-matrix Inversion Combining Level: 1 Combining Level: 2 Compact Representation 




Figure 1 : Decomposition of block tridiagonal matrix K into four sub-matrices, where 
the shaded blocks correspond to the bridge matrices. The two combining levels follow 
the individual sub-matrix inversions, where ^J^j represents the inverse of divisions (|), 
through (Sfj from the matrix K. Matrix mappings will be used to capture the combining 
effects and allow for the direct computation of the block tridiagonal portion of G^. 



eventually be responsible for the elements from both the generator sequences and di- 
agonal blocks that correspond to the initial decomposition (e.g. if (|)i is responsible for 
blocks 1,2, and 3 from the matrix K, the mappings will allow for the computation of 

~S ^ 
^1 .3' ^L..3' andZ)i...3). 

In order to derive the mapping relationships needed to produce a distributed com- 
pact representation, it is first necessary to analyze how sub-matrix inverses can be 
combined to form the complete inverse. Consider the decomposition of the block 
tridiagonal matrix K into two block tridiagonal sub-matrices and a correction term, 
demonstrated below: 



^ V ' 

K 

(|)i =tri(Ai;;t,fii;;t_i), <^2 = ^r^(Ak+\:Ny,Bk+]_:Ny-i), and 



/O •■• -Bl ••• OV [0 ••• / ••• 0\ 

\p ■■■ -Bk ■■■ Oj ' \p ■■■ I ■■■ Oj' 

Thus, the original block tridiagonal matrix can be decomposed into the sum of a block 
diagonal matrix (with its two diagonal blocks themselves being block tridiagonal) and 
a correction term parametrized by the A'^ x A^^ matrix B^, which we will refer to as the 
"bridge matrix". Using the matrix inversion lemma, we have 

= (K + XY)-^ =K^ - (K-^X) (l + YR-^xy^ (YK-^), 
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where 



-c^2'(:,l)B[^ 



I -c^2'(l,l)B/, 

^2 '0,1) 

<^i'i:,kf 



and (|)[ {:,k) and (|)2 ' (:, 1) denote respectively the last and first block columns of (|)[ 
and ' ■ 

This shows that the entries of K^^ are modified through the entries from the first 
rows and last columns of (^^^ and (jjj as well as the bridge matrix Bk- Specifically, 
since is before or "above" the bridge point we only need the last column of its in- 
verse to reconstruct G^. Similarly, since (^2 is after or "below" the bridge point we only 
need the first column of its inverse. These observations were noted in [2 |, where the 
authors demonstrated a parallel divide-and-conquer approach to determine the diago- 
nal entries for the inverse of block tridiagonal matrices. We begin by generalizing the 
method from [2| in order to compute all information necessary to determine the dis- 
tributed compact representation of (O. That is, we would like to create a combining 
methodology for sub-matrix inverses with two major goals in mind. First, it must al- 
low for the calculation of all information that would be required to repeatedly join 
sub-matrix inverses, in order to mimic the combining process shown in Figure [T] Sec- 
ond, at the final stage of the combining process it must facilitate the computation of the 
block tridiagonal portion for the combined inverses. Details pertaining to the parallel 
computation of are provided in AppendixlAl The time complexity of the algorithm 
presented is 0{N^Ny/ p+N^ log p), with memory consumption O [N^Ny/ p +N^) ■ The 
distribution of the compact representation is at the foundation of an efficient parallel 
method for calculating the less-than Green's Function G^ and greater-than Green's 
Function G^. 



4 Parallel Computation of the Less-than Green's Func- 
tion 

The parallel inversion algorithm described in Section |3] not only has advantages in 
computational and memory efficiency but also facilitates the formulation of a fast, and 
highly scalable, parallel matrix multiplication algorithm. This plays an important role 
during the simulation process due to the fact that computation of the less-than Green's 
Function requires matrix products with the retarded Green' s Function matrix; 



G< = G^E<G**. (6) 

E^, which we will refer to as the less-than scattering matrix, is typically assumed to be 
a block diagonal matrix. We will demonstrate how the distributed compact representa- 
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tion of the semiseparable matrix presented in Section|3]can be used to calculate the 
necessary information from G^. Specifically, the electron density for the device will 
be calculated through the diagonal entries of and the current characteristics through 
the first off-diagonal blocks of G^. 

4.1 Mathematical Description 

Recall that our initial state for this procedure would assume that portions of the block 
tridiaongal (corresponding to the size and location of the divisions) of G^ have been 
calculated and stored. It is important to note that there are many generator represen- 
tations for G* and we would like to select a representation that will facilitate efficient 
calculation of G^. For the mathematical operation shown in (|6), our starting point will 

be describing the A:* block row of G^ in terms of D, , (^gf^ ^ , and g^, the diagonal 
blocks and two generator sequences respectively. The following expressions are used 
to determine the generators from the block tridiagonal of the semiseparable matrix: 

Qi^{gfyDi^[gfy=Q,{Dt)-\ (7) 

Di, Pi, and Qi, are the diagonal, upper diagonal, and lower diagonal blocks respec- 
tively. Thus, the generators (^8^^ ^"d g^ are used to describe the A;* block row of 
semiseparable matrix in the following way: 



G«(^,:) = 

\^i—Ic 1 / — li / 

This generator representation for G* along with the block diagonal structure of E< al- 
lows for us to express each diagonal block of G^ in terms of recursive sequences. Both 

the forward recursive sequence gf and backward recursive sequence gY are depen- 
dent on a common sequence of injections terms 7,- (note: the arrow orientation for g^ 
matches that of g^). The relationships between the sequences and the diagonal blocks 
of G^ are shown below: 
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Similar serial recursions have been shown in JT] and ll26l . Our strategy in this work 
is to exploit the distributed compact representation of G'^ in order to produce these 
sequences efficiently. That is, we will divide the computation needed to calculate the 

three sequences 7,, , and g^. into several sub-problems that can be efficiently sepa- 
rated across many processors. 

As motivation, if we assume that there are A'^. — 2N blocks, we can define two sub- 
problems by separating = E^'' + E^'^, where E"^"' contains blocks 1 through and 
E*^'^ contains blocks + 1 through 2N. Then, we can write: 

G< = G''E<G''* = G' (E<'' +E<'2) G' * = G''T.<-''G''* + G''T.<-^G''* = G<'-^+G<''^. 

For this example, we have assumed an equal separation of the diagonal blocks for the 
scattering matrix E^, i.e. EJ^'' ~ 0, V/ > A^. Thus, for the first sub-problem we have: 
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i = N+l,2,...,2N 



Jr.i + [8r-i] 8i-ui(8r-i) , i = 2,...,N 
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i=N+\,...,2N 



4=0, i^2N,...,N+l 
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8ti^J':l+8r8nuA8r) > i = N-2,...,l 



We then see the following relationships for the second sub-problem: 



7,;2=0, ; = 1,2,...,A? 

/,;2 = D.^f'^D* , i=N+l,2,...,2N 



g<2=0, /=1,...,A^ 
8% = Jr,2 + f^Hi) %Ei;2 f^Hi)^ , i = N + 2,...,2N 



< 

< < / K \ * 



< R < / R\* 

t2=Jr,2+8r8r+l;2{sr) , / = 2A'-1,...,A'+1 

< R < / -S \ * 



The recursions for the case of two sub-problems are illustrated in Figure |2] with A'^ = 
2N — 6. Here, the shaded terms in Figure [2(a)] and Figure |2(b)] repre sent the terms for 
each sub-problem that will be computed on CPU 1 and CPU 2 respectively. In sum- 
mary, the separation of the diagonal blocks, the generator sequences, and the scattering 
matrix evenly across two computers will result in the following order of operations: 
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(a) The non-zero matrices that need to be computed 
for sub-problem I . The terms that are computed on 

CPU 1 are shaded. 



S6;2 



g4 
R 

R 

R 



8372 





■'6;2 








^5:2 






h;2 







g4l2 



R 

^ §4 



(b) The non-zero matrices that need to be computed 
for sub-problem 2. The terms that are computed on 
CPU 2 are shaded. 



Figure 2: Distribution of computation into two sub-problems across two CPUs. 
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CPUl 



CPU 2 



Stage I Ji.i=DiL<'^D*, 
i=l,2,---,N 



i = N+l,2,...,2N 



Stage II 



i = 2,...,N 



Stage III 8t2=8rsUu2[Si~) 



i=N,...,l 



82N;2 - -f^N;! 
< R < / R\* 

St2=h2+gtsUh2\8t) ' 

i = 2N-l,...,N+l 
i=N+l,...,2N 



Stage IV = Jn-i 

i=N-2,...,l 



8n+U2 =Jn+U2 
g%=h2 + {gh)\tl;2{stl)'' , 

i=N + 2,...,2N 



If we consider multiplication to be of order N^, then on a single processor the 
calculation would require 3 x {2Ny)N^ = 6NyN^ operations. In this case we have four 
stages, each with 2 x ^A^^ = NyN^ operations. Therefore, using two processors we 
have reduced the number of multiplications to 4NyN^. We now generalize this process 
and demonstrate further speed-up as both the number of processors and the length of 
the device increase. 

4.2 Parallel Implementation 

In order to simplify the presentation of the method, we will assume that the number 
of sub-problems p evenly divides the total number of blocks Ny = pN. In general 
this assumption is not required. We will separate the G"^ operation evenly into p sub- 
problems by dividing Z< = -|- + ...l/^'P, where Z^-*^ contains the portion 
of the less-than scattering matrix. Given that l.f''^ = 0, V; > ^A^ and i <{k— V)N -\- 1, 
the A:"* sub-problem will be solved through the following recursions: 
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gfk 



4 
4 



^ J,:k + 8hti;k (^f ) * ' i^kN-l,...,{k-l)N+l 



So, in a sense by splitting the scattering matrix we have divided the injection sequence / 

evenly and created two new sub-problem propagating sequences gf^ and gp^. However, 
the sub-problem propagating sequences have the special property that they do not start 
(are zero) until they reach the indices governed by the sub-problem. This is due to the 
fact that there are no injection terms for the given sub-problem until these points are 
reached. In addition, once outside the range of the sub-problem we do not have any 
additional injection terms in the recursions. Thus, we have a standard two-sided auto- 
regressive expression where the terms are simply propagated by multiplication with 
generator matrices. 

The recursions for the case of four sub-problems are illustrated in Figure |3]for an 
example with A'j, = 4N =12. Here, the shaded terms in Figure [3(a)] and Figure [3(b)] 
represent the terms for each sub-problem that are computed on CPU 1 and CPU 4 re- 
spectively. For each sub-problem we introduce two new sequences that will be referred 



to as "skip matrices". Specifically, we define for each processor k two matrices 



and 



k 

that are accumulations of the generator terms stored on the processor As 



can be clearly seen from Figure |3] these skip matrices allow for several steps in the 
recursive process to be preformed through a single operation. Therefore, if the skip 
matrices are made available to each processor several terms for each sub-problem can 

be found concurrently. In Figure [3(b)] we can see that skip matrices will allow for g^^^, 

< < 

g^^, and to be determined currently by CPUs 3, 2, and 1 respectively. 
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(a) The non-zero matrices that need to be computed 
for sub-problem 1. The terms that are computed on 
CPU 1 are shaded. 
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(b) The non-zero matrices that need to be computed 
for sub-problem 4. The terms that are computed on 
CPU 4 are shaded. 



Figure 3: Distribution of computation into four sub-problems across four CPUs. 



Given that each sub-problem will produce both a forward and backward propa- 
gating sequence it should be clear that CPU k will need to preform computation for 
g"^ sequences from sub-problems {k+ l,k + 2,...,p} and sequences from sub- 
problems {1,2,. 1}. However, each g"^ sequence and g^ sequence from other 
sub-problems may be combined before the generators on that CPU are applied. This 

is due to the fact that the sequences from each sub-problem will eventually be added 

p 

together to form = Y, G^'^- For the example shown in Figure |3]we see that CPU 

k=l 

1 will first need to form: 
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before the required terms for each sub-problem can be calculated. That is, after the lead 

/ < < < \ 

term for the backward propagating sequence: I g"^^ + ^473 + ^472 1 has been computed, 

R R R 

the generators governed by CPU 1: ^3", g^ , and g\~, can be applied to fulfill the 
sub-problem solutions. We now have all the tools necessary to construct a general 
procedure: 



1. Compute the skip products of each generator sequence g and g<-, i.e. com- 



pute 



kN R 

= n 8t and 

* i=(k-l)N+\ 

8n,. 



-I*: !=A-A'-1 



n [8}) 



where we will define 



2. Compute the initial injection and propagation terms for each sub-problem: 

i={k-l)N+l,...,kN 



hk+{8h)\tl;k (^Hl)' 



, = (A:-1)A?+1 
i={k-\)N + 2,...,kN 
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4 



■ Ji;k, 



R^ < 



i = kN 

i^kN-l,...,{k-l)N+l 



3. Transfer forward and backward skip products: 



and 



as well as 



foi-ward and backward lead propagating matrices: g^.^ and 8(k^i)M+i-k' 
sub-problem k to all CPUs. This will be a total of ApNx^ entries. 



for each 
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4. CPU k will construct combined lead propagating term for g*^ sequences from 
sub-problems 1, A: + 2, . . .,/:>}. 



P < < P ( 

H 8kN+l;j ~ 8kN+l;k+l~^ H I PI 
j=k+l j=k+2 \l=k+l 



' k+i 

n 



CPU k will construct combined lead propagating term for g'^ sequences from 
sub-problems {1,2, . . . ,k — 1}. 



k-l 



k-2 / j+l 



H 8(k-l)N;j — 8{k-l)N;k-l + 12 11 



^ j8jN+l-j 



' k-\ 
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5. Successively multiply by the governed generator terms g^ and g in order fulfill 
sub-problem solutions. 

j=k^\ \l=i j i=k+\ \l=kN J 

i^kN,...,{k-l)N+l. 



k-l ^ / {k-l)N j\ k-l / (i-i) , c\ 

L8^-( n ^0 n {81') I 

J=l \/=(!-l) / .7=1 V/=(*-l)A7 / 

, = (A:-l)A^+l,...,feA^. 

6. Each CPU will combine portions of all sub-problem solutions in order to produce 

corresponding portion of the diagonal blocks of based upon the relationships 
shown in ([8]l. 



In order to analyze the computational improvement for the approach we consider 
the number of multiplications required for each stage. The accumulation of the skip 
matrices described in Step 1 will result in IN^Ny/ p multiplications. The individual 
sub-problem recursions of Step 2 results in 6N^Ny/p multiplications. Step 4 describes 
how the skip matrices may be applied in order to create the sum for each sub-problem 
propagating sequence, requiring 2N^{p — 2) multiplications. Finally, these two sums 
must be propagated through each governed generator matrices, requiring 4-N^Ny/p 
multiplications. If one is interested in also computing the current characteristics for 
the device, they may be determined through the off-diagonal blocks of : 



G<ii,i+i)^gt^i[gfy +{gfy gf, i^i,...,Ny-L (9) 

As portions of each generator sequence and propagating sequence have been evenly 
distributed amongst the processors, the resulting computation would be 2N^Ny/p. There- 
fore, the total number of multiplications for the approach is 2N^{p — 2) + lAN^Ny/ p 



16 





2nm cross-section 


3nm cross-section 


4nm cross-section 


p 


4 1 8 1 16 1 32 


4 1 8 1 16 1 32 


4 1 8 1 16 1 32 




35nm length 




35.8 


21.5 


16.2 


N/A 


411.1 


241.4 


163.6 


N/A 


1915.5 


1099.7 


726.1 


N/A 


G< 


22.7 


12.7 


10.0 


N/A 


253.7 


133.7 


81.2 


N/A 


1159.2 


598.5 


349.7 


N/A 


RGFx 


1.5 


2.7 


3.4 


N/A 


1.4 


2.6 


4.1 


N/A 


1.4 


2.6 


4.2 


N/A 




53nm length 




51.4 


29.3 


19.7 


N/A 


594.4 


331.9 


211.4 


N/A 


2756.1 


1523.4 


946.4 


N/A 


G< 


33.5 


18.5 


13.8 


N/A 


380.9 


198.5 


113.8 


N/A 


1741.6 


887.8 


489.1 


N/A 


RGFx 


1.6 


2.8 


4.0 


N/A 


1.5 


2.7 


4.5 


N/A 


1.4 


2.7 


4.6 


N/A 




70nm length 


G" 


67.1 


38.8 


23.8 


17.7 


779.0 


422.6 


255.0 


181.3 


3601.3 


1953.7 


1154.5 


782.1 


G< 


44.6 


24.4 


15.8 


12.9 


507.6 


261.5 


145.5 


103.4 


2324.2 


1182.2 


640.2 


408.9 


RGFx 


1.6 


2.8 


4.5 


5.7 


1.5 


2.8 


4.8 


6.8 


1.4 


2.7 


4.9 


7.4 



Table 1 : Runtime comparisons between parallel G and algorithms and serial RGF 
approach. Silicon nanowires with lengths 35nm, 53nm, and 70nm are examined with 
cross-sections of 2nm, 3nm, and 4nm. "RGFx" is the observed speed-up for the com- 
bined time. "N/A" is used for devices that are too short to support the number of CPUs. 



compared to 6N^Ny for the serial algorithm. Therefore, the speedup of our approach is 

P 



1 M I P(P-2) 



(10) 



We can clearly see that if Ny» p we will approach a speedup of p/23A. 



5 Results 

The parallel and algorithms have been implemented in C using MPI for inter- 
processor communication. All computational complexity analyses were performed on 
a cluster of Intel E5410 processors with 16GB of shared memory for the 8 core ma- 
chines. The OMEN simulator ||251 was used to perform simulations for square silicon 
nanowires employing the sp3d5s* atomistic tight-binding model with electron-phonon 
scattering. We will begin by demonstrating the computational efficiency of our ap- 
proach. We will then analyze device characteristics for large cross-section nanowires 
that have prohibitive memory requirements for the serial RGF approach. 

Each NEGF nanowire simulation requires thousands of G^, G^, and G-^ compu- 
tations. For our approach, as well as RGF, the time for each computation will remain 
constant given that the underlying structure of the Hamiltonian and scattering matrices 
does not change. In order to analyze the computational benefits of our approach we 
have examined several different cross-section sizes and lengths of nanowires. Specifi- 
cally, silicon [100] nanowires with lengths 35nm, 53nm, and 70nm are examined with 
cross-sections of 2nm, 3nm, and 4nm. Table [T] shows the runtime for the G* and G^ 
computations (the time needed for G^ is identical to that of G^). "RGFx" is the ob- 
served speed-up for the combined time of G^, G^, and G^ calculations compared to 
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(a) speed-up is compared against tlieoretical estimate from l llOt for 50nm length 
silicon nanowires. 



12r 



10 



§- 8 
I 

T3 
CD 
CD 

C/) " 






Estimate 




2x2x1 OOnm^ 


-X- 


3x3x1 OOnm^ 



5 10 15 20 25 30 
Number of CPUs 

(b) speed-up is compared against theoretical estimate from dlOt for lOOnm length 
silicon nanowires. 



Figure 4: Verification of theoretical estimate for RGF speed-up considering 50nm 
and lOOnm length silicon nanowires. 
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1.5 2 2.5 3 3.5 4 4.5 
Diameter (nm) 



Figure 5: Phonon-limited electron mobility /j,,/, as function of the diameter of [100] ori- 
ented circular nanowires at room temperature. The electron density along the channel 
is homogeneous and set to « = lelOcm^^. 



RGF. "N/A" is used to for devices that are not long enough to be divided based upon the 
number of CPUs. We can clearly see that the efficiency of our algorithm improves as 
both the length and cross-section of the device increase. There are two reasons behind 
this trend. First, considering a fixed number of processors a longer device will devote 
less of the total time to sub-problem combining. In addition, as the cross-section size 
increases the inter-processor communication costs will be a smaller fraction of the total 
simulation time. We see these effects again when examining Figure |4] Here, we verify 
the accuracy of the estimated scaling trend that was derived in (fTOt . The speed-up 
over RGF for both 50nm and lOOnm nanowires are compared against our theoretical 
estimate. It is important to note that although the speed-up estimate ( fTol i is independent 
of the cross-section size, effects such as data access time, vector scaling/addition, and 
inter-processor communication will play a role in determining the efficiency. Thus, 
in both Figures |4(a)| and [4(b)| we again see improved efficiency when considering the 
larger 3nm cross-section nanowires. In the case of the 3 x 3 x lOOnm^ nanowire we 
achieve 11 .8 x speed-up when utilizing 32 CPUs. 

In addition to providing computational improvements, our algorithm facilitates the 
simulation of larger cross-section devices. If we consider the 4nm cross-section de- 
vices analyzed in Table [T] the RGF method would require between 16GB and 32GB 
of memory for lengths of 35nm to 70nm. These devices would not be able to be ana- 
lyzed without the use of special purpose hardware. As an illustration of the capacity 
for our algorithm to simulate devices previously viewed to have prohibitive memory 
requirements, we stably generate the mobility for 4.5nm cross-section devices with 
electron-phonon scattering. In order to facilitate complete nanowire simulations we 
have implemented our algorithm on dual hex-core AMD Opteron 2435 (Istanbul) pro- 
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cessors running at 2.6GHz, 16GB of DDR2-800 memory, and a SeaStar 2+ router 
As an application, the low-field phonon-limited mobility of electrons /jp/, is calculated 
in circular nanowires with diameters ranging from 1 .5 to 4.5nm and transport along 
the [100] crystal axis. The "dR/dL" method ||27l and the same procedure as in ||28l 
are used to obtain /jpi,. The channel resistance "R" is computed as function of the 
nanowire length "L" and then converted into a mobility. Here, "L" is set to 35nm, "R" 
is computed in the limit of ballistic transport and in the presence of electron-phonon 
scattering, and the difference between these two points is considered to evaluate dR/dL. 
The results are shown in Figure |5] From a numerical perspective, the computation of 
each Green's Function, at a given energy, was parallelized on 16 CPUs for all the device 
structures. As was alluded to above the simulation of these structures would not have 
been possible (due to memory restrictions) without the decomposition of the device 
through our parallel methods. 

6 Conclusions 

In this work we have developed algorithms for parallel NEGF simulation with scat- 
tering. The computational benefits of our approach have been demonstrated on large 
cross-section silicon nanowires. We show improvements of over 1 1 x for the and 
computations. In addition, our approach enables simulations without the need for 
special purpose hardware. This can best be observed through our simulation results 
for 4.5nm cross-section silicon nanowires. The algorithms developed in this work 
are applicable for a wide range of device geometries considering both atomistic and 
effective-mass models. In addition to offering significant computational improvements 
over the serial Recursive Green's Function algorithm, our approach facilitates simula- 
tion of realistically sized devices on typical distributed computing hardware. 
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A Parallel Computation of the Retarded Green's Func- 
tion 

In Appendix [Al we build upon the approach of |2| to consider the computation of a 
distributed generator representation for G^. In Section FaTI we introduce a mapping 
framework for combining sub-matrices corresponding to subsets of the device struc- 
ture. This has similarities to the approach presented in [2 1, where in that case only the 
diagonal entries of were needed to describe the density of states. Section [ASj illus- 
trates how a recursive process can be formulated in order to reconstruct the compact 
representation of G^. This involves several key differences when compared to [21 as 
significantly more information is required from G^. Finally, in Section [A3] the parallel 
algorithm is summarized including an analysis of the computational complexity. 
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A.l Matrix Maps 

Matrix mappings are constructed in order to eventually produce the block tridiagonal 
portion of while avoiding any unnecessary computation during the combining pro- 
cess. Specifically, we will show that both the boundary block entries (first block row 
and last block column) and the block tridiagonal entries from any combined inverse 
(^^^- must be attainable (not necessarily computed) for all combining steps. We begin 
by illustrating the initial stage of the combining process given four divisions, where for 
simplicity each will be assumed to have blocks of size A'^, - First, the two sub-matrices 
and (|)2 are connected through the bridge matrix and together they form the larger 
block tridiagonal matrix (|)i^2- By examining Figure [T]it can be seen that eventually 
(|)J^^2 ^iid (|)3^4 will be combined and we must therefore produce the boundaries for 
each combined inverse. From (|5]l the first block row and last block column of (|)[^, can 
be calculated through the use of an "adjustment" matrix: 



as follows: 



/ -^2i(l,l)Br ' 

{N,N)Bn I 



T 



(11) 



V [-*1'2 (^'1)^^-^21(1)2 (1,:)J / 

In addition, the r* diagonal block of (|)j^2 calculated using the following rela- 

tionships for r <N : 



(if-i2ir,r)=(ifYHr,r)-{~(^-\r,N)BNJn<^^\r,Ny), 

(12) 

<^^i2i'- + N,r + N) =<^^\r,r) - {-,^-\r,l)B],J2i^^\l,r)))\ 

where the off-diagonal block of (|)j^2 be calculated using the following relation- 
ships: 
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(13) 

r<N, 

riUr,r+ 1) = 0- (-^^i(r,iV)fiiv^ii^2 ' 
r = N. 

The combination of (|)3 and (|)4 through the bridge matrix Bt,n resuhs in similar rela- 
tionships to those seen above. Thus, in order be able to produce both the boundary and 
block tridiagonal portions of each combined inverse we assign a total of twelve Nx x A^^ 
matrix maps for each sub-matrix k. Mt,\_4 describe effects for the fc* portion of the 
boundary, Mj^-^^^ describe the effects for a majority of the tridiagonal blocks, while 
Q;i-4. which we will refer to as "cross" maps, can be used to produce the remainder 
of the tridiagonal blocks. 

Initially, for each sub-matrix ; the mappings M^-i = /, ^=1,4, with all remaining 
mapping terms set to zero. This ensures that initially the boundary of ^J^^ matches 
the actual entries from the sub-matrix inverse, and the modifications to the tridiagonal 
portion due to combining are all set to zero. By examining the first block row, last 
block column, and the tridiagonal portion of the combined inverse (|)j^2 we can see 
how the maps can be used to expUcitly represent all of the needed information. The 
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Ml.2 M2;2 




M3., M4J 



M3;2 M4;2 

Figure 6: Mapping dependencies when combining (^'[^ and to form (j)i^2- 



governing responsibilities of the individual matrix maps are detailed below: 



(^1^2(1:0= I [Mi;2^2i(l,:)+M2;2C^2H:,A^)' 



[M3;2*2'(l-)+A^4;2^2'0,^)' 



v 



(^;'ir,N)MTj<^^\l,s)+<^-\r,N)M,.i<^-\s,Nf], 



(14) 



(^7i2('--^+^) = -[«^r'('-,l)Cl;lC^2^(l,^)+^r'('-:l)C"2;lC^2H^,A^)^ + 



^riz ('• + + ^) - *2 ' ^) - [^2 ' 1 )^5;2^2 ^ 1 , ^ ) + (^2 ' 1 )^6;2^2 ' ('^: ^) ^ ^ 
(^2^^'-, A^)A^7;2C^2 ' (1 , ^) + ^2 ' ('•, A^)M8;2C^2 ' A^)^] , 



r,s<N, 

It is important to note that all of the expressions ([TTl-lfTjt can be written into the matrix 
map framework of (fl4] i. Figure |6] shows the mapping dependencies for the first block 
row and last block row (or column since K is symmetric). From (fTTt we see that 
both of the block rows are distributed based upon the location of each sub-matrix with 
respect to the bridge point, i.e. the mapping terms associated with (^^^ can be used to 
produce the first portion of the rows while those associated with (^2^ can be used for 
the remainder. In fact, this implicit division for the mapping dependencies holds for 



25 



the block tridiagonal portion of the combined inverses as well, enabling an efficient 
parallel implementation. Thus, from this point we can deduce that the matrix maps for 
the first block row (fT4t must be updated in the following manner: 

M2;l ^ M2:l + {(^i\hN)BNJn)M4;i; 

^(^r'(l'^Wii)^i;2; 

In order to understand these relationships it is important to first recall that the up- 
dates to the maps associated with sub-matrix are dependent on the last block column 
(j)i"'(:,A^). Thus, we see a dependence on the previous state for the last block column 

(:,A^), i.e. the new state of the mapping terms Mi-j and M2;i are dependent on the 
previous state of the mapping terms M^-i and M4-i respectively. Similarly, a depen- 
dence on (|)2 '(1,:) results in the new state of the mapping terms Mi;2 and M2;2 being 
functions of the previous state of the mapping terms Mi-2 and Mi;2 respectively. Fi- 
nally, although some of the mapping terms remain zero after this initial combining step 
(M2;2 for example), the expressions described in (fl4l need to be general enough for 
the methodology. That is, the mapping expressions must be able to capture combining 
effects for multiple combing stages, regardless of the position of the sub-matrix with 
respect to a bridge point. For example, if we consider sub-matrix (^2 for the case seen 
in Figure [U during the initial combining step it would be considered a lower problem 
and for the final combining step it would be considered a upper problem. Alternatively, 
sub-matrix (|)3 would be associated with exactly the opposite modifications. It is impor- 
tant to note that every possible modification process, for the individual mapping terms, 
is encompassed within this general matrix map framework. 

A.2 Recursive Combining Process 

In order to formalize the notion of a recursive update scheme we will continue the ex- 
ample from Section lA.ll By examining the final combing stage for the case of four 
divisions, we notice that the approach described in (fTTli-(fT3]l can again be used to com- 
bine sub-matrix inverses (|)j^2 and (^^^^, through the bridge matrix B2n- The first block 
row and last block column of (|) can be calculated as follows: 




given the adjustment matri 
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In addition, the r* diagonal block of <^^\ can be calculated using the following 
relationships: 

(16) 

r<2N, 

where the off-diagonal block of can be calculated using the following relation- 
ships: 

'^iU'-^r+l) = <^ii2i'-^r+ 1) - (-^i-i2(r,2A?)B2iv/i2^ri2('-+ 1.2A?)^) ' 

(17) 

<^i^,{r + 2N,r+l+2N)=(^^i,{r,r+l)-{-<^-l^{r+l,l )B ^21 ^3^4 ( 1 . ) ) ^ 
r<2N, 

^iUr,r+l)^0~{~(^-^^{r,2N)B2NJn<^^UlA)) , 
r^2N. 

Again, it is important to note that each of the expressions (flSll-dTTll are implicitly di- 
vided based upon topology. For example, the first 2N diagonal blocks of (|)[^4 = 
can be separated into two groups based upon the size of the sub-matrices and (|)2. 
That is, 

c^ri4 ('•''•) =^ri2 ('•''■) - {-<i?il2i>-^2N)B2NJi2(ifil2i^,2NV) , 
r<2N, 

can be separated for r <N as: 

<i?iU>■^r)^^-'{r,r)-i^[^-\N^)BlJ22<^-\r,Nrf) ■B2NJn- 
{[<!f2\N,\)BlJ22ry\r,NY]), 

ryUr + N.r + N)=^2\r,r)-(y(^-\N,r)+^^\NA)BlJ2x^^\l,r)Y) ■B2nJi2- 
( '{N,r)+,^^\NA)BlJ2i<^^\l,r)]). 

Thus, the modifications to the diagonal entries can be written as just a function of the 
first block row and last block column from the individual sub-matrices, using the matrix 
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map framework introduced in (IT4l for r <N : 



([Mi;,^^i(l,r)+M2;i(^r'('-'^)n)' 



1^4 (r + A', r + A?) - ^2 ' (r, r) - ( [Ml ;2C^2 ' ( 1 , r) + M2;2^2 H'", A^)] ^) ' 'B2iv/i2 • 

([Mi;2^2 +^2;2^2'('-,A^)])- 



Here, the matrix maps are assumed to have been updated based upon the formation of 
the combined inverses (|)j^2 ™d 'l'3-!^4- Therefore, we can begin to formulate the recur- 
sive framework for updating the matrix maps to represent the effect of each combining 
step. 

A.3 Update Scheme for Parallel Inversion and the Distributed Com- 
pact Representation 

The procedure begins with each division of the problem being assigned to one of p 
available CPUs. In addition, all of the p — I bridge matrices are made available to 
each of the CPUs. After the compact representation for each inverse has been found 
independently, the combining process begins. Three reference positions are defined 
for the formation of a combined inverse '^J'Jj' the "start" position [st] = /, the "stop" 

position [sp] — j, and the bridge position [bp] = [^]. Due to the fact that a CPU t 
will only be involved in the formulation of a combined inverse when [st] < t < [sp] 
all combining stages on the same level (see Figure [U can be performed concurrently. 
When forming a combined inverse (j),^'^, each CPU [st] < f < [sp] will first need to form 
the adjustment matrix for the combining step. Assuming a bridge matrix B^, we begin 
by constructing four "corner blocks". If the upper combined inverse is assumed to have 
Nu blocks and the lower to have Ni, the two matrices need from the upper combined 
inverse are: [UR] = 'I'jsij^jbpjlliMi) and [LR] = (j^jstj^ibp] (^«'^")' with the two matrices 
from the lower being: [UL] = ^l^jbp+ij^tspjCl, 1) and [LL] ^ *l)[bp+i]^[3p] (A^/, !)■ These 
matrices can be generated by the appropriate CPU through their respective matrix maps 
(recall the example shown in Figure |6}. Specifically, the CPUs corresponding to the 
[st] , [bp] , [bp + 1] and [sp] divisions govern the required information. The adjustment 
matrix for the combining step can then be formed: 



After the adjustment matrix has been calculated the process of updating the matrix 
maps can begin. For any combining step, the cross maps each CPU t must be updated 
first: 




/ 

[LR]Bk 
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if (? < [bp]) then 
Ci ^Ci-Ml.,{BkJn)Mi.^+\\ 
C2^C2- Ml,{BkJn)M4;,+i ; 
C3^C3-Ml,{BkJi2)M3,t+i; 

C4^C4-Ml,{BkJn)M4;,+i; 
elseif (f==[bp]) then (18) 

Ci^Ci-Ml,{Bi,Jn)Mi,,+i; 

Ci^Ci- Ml, {BkJi 1 )M2;t+i ; 

C3^C3-Mlj{BkJn)Mi,t+i; 

C4 ^ C4 -Mj.,(Bfcyii)M2;,+i; 
elseif {t < [sp]) then 

Ci^Ci-Mf.,(472i)Mi;,+i; 

C2^C2-M[.,(B[/2i)M2;r+i; 

C3 ^C3-MJ.,(b[72i)Mi;,+i; 

C4 ^ C4-Mlj{BlJ2i)M2;t+i; 

Notice that the cross maps for CPU t are dependent on information from its neighbor- 
ing CPU t + 1. This information must be transmitted and made available before the 
cross updates can be performed. Next, updates to the remaining eight matrix maps 
can be separated into two categories. The updates to the matrix maps for the upper 
sub-matrices {t < [bp]), are summarized below: 



Ms 
Mj 
Ml 

M2 

My, 

M4 



^Ms;t-Ml,{BkJn)M3;t; 
^M(,,t-Ml.,(BkJn)M4;t\ 
^MT,t-Ml.,{BkJn)M3;t; 
^Ms;t-Ml.,{Bi,Jn)M4-,; 
^Mi,, + {[m]BkJn)M3-t; 

^M2;t + {[m]BkJl2)M4r, 

^ {[LLfBlj22)M3-/, 
^ {[LLfBlJ22)M4;,; 



(19) 
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The updates to the matrix maps for the lower sub-matrices [t > [bp]), will be: 



Me-, ^ Me: 

M3;( ^ M3: 



-j-Mj.,{Bjj2i)Mi-y, 
-.r-Ml,{BlJ2i)M2-y, 
-,r-Ml,{BlJ2i)Mi-y, 
-j-Ml,{BlJ2i)M2-y 
:, + ([LL]^b[72i)Mi;,; 



(20) 



M4;, ^ M4;, + ([LL]^B[72i)M2;r; 
Ml;, ^ ([UR]Bi7ii)Mi;,; 
M2;, ^([UR]BaJii)M2;,; 



The above procedure, shown in (fTS^-dSOt. for modifying the matrix maps can be re- 
cursively repeated for each of the combining stages beginning with the lowest level of 
combining the individual sub-matrix inverses. 

On completion the maps can then be used to generate the block tridiagonal en- 
tries of G^. This subsequently allows for the computation of the generator sequences 
for G*, via the relationships shown in in a purely distributed fashion. The time 
complexity of the algorithm presented is 0{N^Ny/ p + N^logp), with memory con- 
sumption O (N^Ny/ p +N^) ■ The first term {N^Ny/p) in the computational complexity 
arises from the embarrassingly parallel nature of both determining the generator se- 
quences and applying the matrix maps to update the block tridiagonal portion of the 
inverse. The second term (A^^logp) is dependent on the number of levels needed to 
gather combining information for p sub-matrix inverses. Similarly, the first term in 
the memory complexity is due to the generator sequences and diagonal blocks, and 
the second represents the memory required for the matrix maps of each sub-problem 
governed. 
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[UR] 




T[st] ~[bp] 








[LR] 






[UL] 






1 


-1 

[bp+l]~[sp] 




[LL] 





